在时间差异增强学习算法中,价值估计的差异会导致最大目标值的不稳定性和高估。已经提出了许多算法来减少高估,包括最近的几种集合方法,但是,没有通过解决估计方差作为高估的根本原因来表现出样品效率学习的成功。在本文中,我们提出了一种简单的集合方法,将目标值估计为集合均值。尽管它很简单,但卑鄙的(还是在Atari学习环境基准测试的实验中显示出明显的样本效率)。重要的是,我们发现大小5的合奏充分降低了估计方差以消除滞后目标网络,从而消除了它作为偏见的来源并进一步获得样本效率。我们以直观和经验的方式为曲线的设计选择证明了合理性,包括独立经验抽样的必要性。在一组26个基准ATARI环境中,曲线均优于所有经过测试的基线,包括最佳的基线,日出,在16/26环境中的100K交互步骤,平均为68​​%。在21/26的环境中,曲线还优于500k步骤的Rainbow DQN,平均为49%,并使用200K($ \ pm $ 100k)的交互步骤实现平均人级绩效。我们的实施可从https://github.com/indylab/meanq获得。
translated by 谷歌翻译
强大的增强学习(RL)考虑了在一组可能的环境参数值中最坏情况下表现良好的学习政策的问题。在现实世界环境中,选择可靠RL的可能值集可能是一项艰巨的任务。当指定该集合太狭窄时,代理将容易受到不称职的合理参数值的影响。如果规定过于广泛,则代理商将太谨慎。在本文中,我们提出了可行的对抗性鲁棒RL(FARR),这是一种自动确定环境参数值集的方法。 Farr隐式将可行的参数值定义为代理可以在足够的培训资源的情况下获得基准奖励的参数值。通过将该问题作为两人零和游戏的配方,Farr共同学习了对参数值的对抗分布,并具有可行的支持,并且在此可行参数集中进行了强大的策略。使用PSRO算法在这款FARR游戏中找到近似的NASH平衡,我们表明,接受FARR训练的代理人对可行的对抗性参数选择比现有的minimax,domain randanmization,域名和遗憾的目标更强大控制环境。
translated by 谷歌翻译
在竞争激烈的两种环境中,基于\ emph {double oracle(do)}算法的深度强化学习(RL)方法,例如\ emph {policy space响应oracles(psro)}和\ emph {任何时间psro(apsro)},迭代地将RL最佳响应策略添加到人群中。最终,这些人口策略的最佳混合物将近似于NASH平衡。但是,这些方法可能需要在收敛之前添加所有确定性策略。在这项工作中,我们介绍了\ emph {selfplay psro(sp-psro)},这种方法可在每次迭代中的种群中添加大致最佳的随机策略。SP-PSRO并不仅对对手的最少可剥削人口混合物添加确定性的最佳反应,而是学习了大致最佳的随机政策,并将其添加到人群中。结果,SPSRO从经验上倾向于比APSRO快得多,而且在许多游戏中,仅在几次迭代中收敛。
translated by 谷歌翻译
软演员 - 评论家(SAC)被认为是连续动作空间设置中的最先进的算法。它使用最大熵框架进行效率和稳定性,并应用启发式温度拉格朗日术语来调整温度$ \ Alpha $,这决定了策略应该如何“软”。经验证据表明SAC在离散域中表现不佳是反直观的。在本文中,我们研究了这种现象的可能解释,并提出了靶熵调度囊(TES-囊),用于施加在囊上的靶熵参数的退火方法。目标熵是温度拉格朗日术语中的常数,表示离散囊中的目标政策熵。我们将我们的方法与不同常数目标熵囊的Atari 2600游戏进行比较,并分析我们的调度如何影响囊。
translated by 谷歌翻译
最大熵增强学习(MaxEnt RL)算法,如软Q-Learning(SQL)和软演员 - 评论家权衡奖励和政策熵,有可能提高培训稳定性和鲁棒性。然而,大多数最大的RL方法使用恒定的权衡系数(温度),与温度应该在训练早期高的直觉相反,以避免对嘈杂的价值估算和减少培训后,我们越来越多地信任高价值估计,避免危险的估算和减少导致好奖励。此外,我们对价值估计的置信度是国家依赖的,每次使用更多证据来更新估算时都会增加。在本文中,我们提出了一种简单的状态温度调度方法,并将其实例化为基于计数的软Q学习(CBSQL)。我们在玩具领域以及在几个Atari 2600域中评估我们的方法,并显示有前途的结果。
translated by 谷歌翻译
Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation. We argue for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks. We demonstrate the benefits of this principle through RLlib: a library that provides scalable software primitives for RL. These primitives enable a broad range of algorithms to be implemented with high performance, scalability, and substantial code reuse. RLlib is available as part of the open source Ray project 1 .
translated by 谷歌翻译
Unsupervised learning-based anomaly detection in latent space has gained importance since discriminating anomalies from normal data becomes difficult in high-dimensional space. Both density estimation and distance-based methods to detect anomalies in latent space have been explored in the past. These methods prove that retaining valuable properties of input data in latent space helps in the better reconstruction of test data. Moreover, real-world sensor data is skewed and non-Gaussian in nature, making mean-based estimators unreliable for skewed data. Again, anomaly detection methods based on reconstruction error rely on Euclidean distance, which does not consider useful correlation information in the feature space and also fails to accurately reconstruct the data when it deviates from the training distribution. In this work, we address the limitations of reconstruction error-based autoencoders and propose a kernelized autoencoder that leverages a robust form of Mahalanobis distance (MD) to measure latent dimension correlation to effectively detect both near and far anomalies. This hybrid loss is aided by the principle of maximizing the mutual information gain between the latent dimension and the high-dimensional prior data space by maximizing the entropy of the latent space while preserving useful correlation information of the original data in the low-dimensional latent space. The multi-objective function has two goals -- it measures correlation information in the latent feature space in the form of robust MD distance and simultaneously tries to preserve useful correlation information from the original data space in the latent space by maximizing mutual information between the prior and latent space.
translated by 谷歌翻译
The usage of technologically advanced devices has seen a boom in many domains, including education, automation, and healthcare; with most of the services requiring Internet connectivity. To secure a network, device identification plays key role. In this paper, a device fingerprinting (DFP) model, which is able to distinguish between Internet of Things (IoT) and non-IoT devices, as well as uniquely identify individual devices, has been proposed. Four statistical features have been extracted from the consecutive five device-originated packets, to generate individual device fingerprints. The method has been evaluated using the Random Forest (RF) classifier and different datasets. Experimental results have shown that the proposed method achieves up to 99.8% accuracy in distinguishing between IoT and non-IoT devices and over 97.6% in classifying individual devices. These signify that the proposed method is useful in assisting operators in making their networks more secure and robust to security breaches and unauthorized access.
translated by 谷歌翻译
Multiple studies have focused on predicting the prospective popularity of an online document as a whole, without paying attention to the contributions of its individual parts. We introduce the task of proactively forecasting popularities of sentences within online news documents solely utilizing their natural language content. We model sentence-specific popularity forecasting as a sequence regression task. For training our models, we curate InfoPop, the first dataset containing popularity labels for over 1.7 million sentences from over 50,000 online news documents. To the best of our knowledge, this is the first dataset automatically created using streams of incoming search engine queries to generate sentence-level popularity annotations. We propose a novel transfer learning approach involving sentence salience prediction as an auxiliary task. Our proposed technique coupled with a BERT-based neural model exceeds nDCG values of 0.8 for proactive sentence-specific popularity forecasting. Notably, our study presents a non-trivial takeaway: though popularity and salience are different concepts, transfer learning from salience prediction enhances popularity forecasting. We release InfoPop and make our code publicly available: https://github.com/sayarghoshroy/InfoPopularity
translated by 谷歌翻译
Almost 80 million Americans suffer from hair loss due to aging, stress, medication, or genetic makeup. Hair and scalp-related diseases often go unnoticed in the beginning. Sometimes, a patient cannot differentiate between hair loss and regular hair fall. Diagnosing hair-related diseases is time-consuming as it requires professional dermatologists to perform visual and medical tests. Because of that, the overall diagnosis gets delayed, which worsens the severity of the illness. Due to the image-processing ability, neural network-based applications are used in various sectors, especially healthcare and health informatics, to predict deadly diseases like cancers and tumors. These applications assist clinicians and patients and provide an initial insight into early-stage symptoms. In this study, we used a deep learning approach that successfully predicts three main types of hair loss and scalp-related diseases: alopecia, psoriasis, and folliculitis. However, limited study in this area, unavailability of a proper dataset, and degree of variety among the images scattered over the internet made the task challenging. 150 images were obtained from various sources and then preprocessed by denoising, image equalization, enhancement, and data balancing, thereby minimizing the error rate. After feeding the processed data into the 2D convolutional neural network (CNN) model, we obtained overall training accuracy of 96.2%, with a validation accuracy of 91.1%. The precision and recall score of alopecia, psoriasis, and folliculitis are 0.895, 0.846, and 1.0, respectively. We also created a dataset of the scalp images for future prospective researchers.
translated by 谷歌翻译